Analysis of RNA-seq data from a neurexin-deficient worm

Introduction

The RNA-seq data in this project originates from sequencing done at the Genomics Centre of QMUL. The samples came from Prof. Filipe Cabreiro's lab (then UCL, now Imperial) from normal and neurexin-deficient worms - the work was mostly carried out by Peter Cooke (then Phd student). Other members in Filipe's lab had helped out at the time with supervision and RNA extraction. The sequencing centre did not achieve good ribosomal depletion as it transpired that they did not have the right kit for C.elegans. Hence, only medium/high expressed genes are seen in these data and ribosomal RNA reads must be removed before analysing the data further.

Aine did a differential gene expression analysis based on counting reads using HTseq count and doing differential expresion with DESEq2 and limma. However, if I remember well, she missed out ~ half the reads due to a problem in one of her shell scripts (when pooling together the lanes, I think she pulled only two out of the four available).

It is now recognised that gene expression analysis should be based on aggregating the results of transcript-based analysis. Below is a list of links that describe how this should be done:

The data

The data comprise 64 FASTQ files in total. There are 8 samples, each split into 4 lanes and each being sequenced in paired-end mode, hence there are: 8 x 4 x 2= 64 files.All data is from reversely-stranded libraries i.e. read 2 is in the same direction as the transcript on the genome. - 4 normal samples, names starting with "N" (biological replicates) - 4 neurexin-deficient samples, names starting with "SG" (biological replicates).

All samples originated from 1-day old adult worms.

Below is an outline of the steps to analyse the RNA-seq data from Aine's project, starting with raw fastq data and ending with mapped reads to the C. elegans transcriptome using kallisto.

All code was run on the departmental server thoth. Where the variable $PROJECT is used below it refers to the following directory: /d/in16/u/ubcg71a/research/filipe/ .

Data quality analysis with FastQC/multiQC

Initial fastqc data quality analysis on raw data was carried out by both Aine and repeated by Krzysztof at some point. Results can be found here: /d/in13/u/kszkop01/worm_neurexin/fastqc/

Had I run it myself I would have used my bash script run_fastqc.sh for running fastQC on multiple files:

#!/bin/bash
# Runs FASTQC on all fastq.gz files found in given directory
# Run as:
# ./run_fastqc dir_of_fastq_files

timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

FASTQC_DIR="/s/software/fastqc/v0.11.8/FastQC/fastqc"

echo "Running FASTQC using executable: $FASTQC_DIR"

for fastq_file in $1/*.fastq.gz;
do
  echo "Fastq file: $fastq_file"
  $FASTQC_DIR $fastq_file -o .
done

I summarised K's FASTQC results using multiQC:

module use -s /s/mm/modules
module load python/v2
multiqc /d/in13/u/kszkop01/worm_neurexin/fastqc/

Results are under: $PROJECT/multiqc_on_raw/

All samples have sequences of a single length (76 bp).

Diagnostic plots from FASTQC summarised with multiQC (in all these plots, files with R1 are coloured red, files with R2 are blue):

Alt text Alt text Alt text Alt text

Alt text

Alt text

Alt text

Alt text

Observations from initial quality analysis

  • Quality scores are generally very high even for the last few positions of the read; in general R1 (shown in red in the plots above) slightly better than R2 (shown in blue)
  • There are high levels of duplication (most likely rRNA)
  • There are some Ns mostly at the beginning of the reads
  • GC content shows strange distribution with differences between R1 and R2 (I've seen this before in datasets; maybe bias from the bias due to adapters still present?). GC of C. elegans is supposed to have mean base composition of 36% GC (with the rRNA cistrons being closer to 51%; ref: PMID: 4858229). Our plots show higher GC content in all cases but there is a peak where rRNA should be.
  • There is evidence of adapters present in the 3' end

Read pre-processing with fastp

I trimmed (adapters and poly(A) tails) and quality-filtered the reads using the program fastp :

Raw fastq data is under: $PROJECT/data/Data/170519_NS500784_0235_AHT2JYBGX2/Demulti/Data/Intensities/BaseCalls/GC-IN-6725/

To remove poly(A) tails and adapters, I ran the script under $PROJECT/fastp:

nohup ./run_fastp.sh ../data/Data/170519_NS500784_0235_AHT2JYBGX2/Demulti/Data/Intensities/BaseCalls/GC-IN-6725/ >& run_fastp.out &

Below is the run_fastp.sh script:

#!/bin/bash
# Runs fastp in PE mode for all .fastq.gz files in a directory
# Run as:
# ./run_fastp.sh directory_of_samples 

timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

FASTP_EXEC="/d/in7/s/fastp/fastp"
DIR=$1
ADAPTER_FILE="./adapters_all.fa"

echo "Running fastp using executable: $FASTP_EXEC"

for file in `ls $DIR/*_R1_*.fastq.gz`;
do
  sample=${file/$DIR\//}
  sample=${sample/_R1_001.fastq.gz/}
  echo "Sample= $sample"
  $FASTP_EXEC -i  "$DIR/$sample"_R1_001.fastq.gz  \
              -I  "$DIR/$sample"_R2_001.fastq.gz \
         -o "$sample"_R1_001_trimmed.fastq.gz \
         -O "$sample"_R2_001_trimmed.fastq.gz \
         --adapter_fasta $ADAPTER_FILE \
         -l 30 --trim_poly_x \
         -h "$sample"_fastp.html -j "$sample"_fastp.json
done

NOTE: fastp by default uses the following quality filtering options:

  • 40% of bases are allowed to be at phred quality Q<15 (unqualified)
  • reads with at least 5 Ns are discarded
  • average quality of the read is not checked by default

Data quality re-analysis with FastQC/multiQC following fastp

Re-ran fastQC with the files in the fastp directory: ./run_fastp.sh ../fastp Results in the directory: $PROJECT/fastqc_after_fastp

Then re-ran multiqc and moved the .html file locally to save the plots shown below:

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Results are under $PROJECT/fastqc_after_fastp/multiqc

Observations from re-analysis of quality

  • There are still Ns mostly at the beginning of the reads but these are only in small percentage of reads and only in the first 1-3 bases so will leave them in.
  • GC content shows the same strange distribution hence this has nothing to do with adapters and either to do with biases in the library creation or, more likely, the rRNA that is still present.
  • Adapters have been successfully removed.

Merging data from the 4 lanes to have one file per sample

Seeing that there were no strong lane effects, I merged all 4 lanes for each sample (this could have been done following mapping but there really was no evidence here to suggest a lane effect). The new files are in the $PROJECT/fastp_merged directory and the script run_fastq_merger.sh is given below:

#!/bin/bash
# Merges all fastq files for the same sample and same read orientation into single fastqc 
# Run as:
# ./run_fastq_merger.sh directory_of_fastq_files sample_name sample_number

timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

MERGED_DIR="/d/in16/u/ubcg71a/research/filipe/fastp_merged/"

cd $1
sample=$2
samplenum=$3
echo $(pwd)
echo Merging for sample $sample 
cat "$sample"*"$samplenum"*R1_001*fastq.gz > $MERGED_DIR/"$sample"_"$samplenum"_R1.fastq.gz
cat "$sample"*"$samplenum"*R2_001*fastq.gz > $MERGED_DIR/"$sample"_"$samplenum"_R2.fastq.gz
./run_fastq_merger.sh ../fastp/ N1 S5
./run_fastq_merger.sh ../fastp/ N2 S6
./run_fastq_merger.sh ../fastp/ N3 S7
./run_fastq_merger.sh ../fastp/ N4 S8
./run_fastq_merger.sh ../fastp/ SG1 S1
./run_fastq_merger.sh ../fastp/ SG2 S2
./run_fastq_merger.sh ../fastp/ SG3 S3
./run_fastq_merger.sh ../fastp/ SG4 S4

Mapping processed reads to the genome with STAR

Before carrying out the analysis with kallisto/sleuth, we will map the reads using STAR and carry out some exploratory analysis of the data.

Aine's work showed a huge number of reads mapping to rRNA.Here, we will allow them to map and exclude them later. During the kallisto run, we will remove them from the transcriptome before mapping to save hassle.

First, we need to create an index of the genome with STAR:

module load star/v2.7

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir . --genomeFastaFiles ../Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --sjdbGTFfile ../Caenorhabditis_elegans.WBcel235.99.gtf --genomeSAindexNbases 12
#(note that parameter genomeSAindexNbases must be scaled down to 12 from the default 14 because the genome
#is smaller than the human genome)

#The genome index is currently under: /d/in16/u/ubcg71a/research/genomes/c.elegans/ensembl_99/STAR_INDEX/

#Once the genome index is created we can map the files using a bash script:
nohup ./run_star.sh ../fastp/ list_files > & run_star.out &
#where list_files contains a list of all samples
#N1_S5
#N2_S6
#N4_S8
#N3_S7
#SG1_S1
#SG2_S2
#SG3_S3
#SG4_S4

The script run_star.sh is given below:

#!/bin/bash

# Runs STAR mapping on all samples in a given directory 
# Samples must end in .fastq.gz and are given as a list in a file (second argument)

# Run as:
# run_star.sh directory_of_fastq_files list_of_samples 

timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

dir=$1
shift

#set location of executables
STAR_EXEC="/s/software/STAR/bin/Linux_x86_64/STAR"

numProc=5   #number of processors/threads to be used
genomeIndex="/d/in16/u/ubcg71a/research/genomes/c.elegans/ensembl_99/STAR_INDEX/" #directory for genome index files 

for sample in `cat $1`;
do
   echo "Running STAR on sample $sample (paired-end reads) ..."

   inputFile1="$dir$sample"_R1.fastq.gz
   inputFile2="$dir$sample"_R2.fastq.gz

   STAR --runThreadN $numProc \
        --runMode alignReads \
        --outSAMtype BAM SortedByCoordinate \
        --readFilesCommand zcat \
        --genomeDir $genomeIndex \
        --outFileNamePrefix "$sample"_ \
        --readFilesIn $inputFile1 $inputFile2 \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts

done

echo "All done!"

STAR produces separate files with the unmapped reads (which we can use, for example, to map to circular RNA transcripts later) and it also produces gene counts. Ideally one should rely on transcript counts for differential gene expression; here we will use gene counts for the exploratory part of the analysis only.

I summarised the STAR output by running multiqc again. Results are under $PROJECT/star_mapping/multiqc/

Alt text Alt text Alt text

Exploratory data analysis with STAR gene counts

Set up libraries and functions

I start by defining my own version of DESeq2's plotPCA() function so that I can define which PCs I will be plotting

#define my own pcaplot to allow other components besides PC1 and PC2 to be plotted
#the function itself is copied from DESEq2
myPlotPCA <- function (x, intgroup = "condition", pc1 = 1, pc2 =2, ntop = 500, returnData = FALSE) 
{
  rv <- genefilter::rowVars(assay(x)) #if using with SummarizedExperiment objects
  select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, 
                                                     length(rv)))]
  pca <- prcomp(t(assay(x)[select, ]))
  percentVar <- pca$sdev^2/sum(pca$sdev^2)
  if (!all(intgroup %in% names(colData(x)))) {
     stop("the argument 'intgroup' should specify columns of colData(dds)")
  }
  intgroup.df <- as.data.frame(colData(x)[, intgroup, drop = FALSE])
  group <- if (length(intgroup) > 1) {
            factor(apply(intgroup.df, 1, paste, collapse = " : "))
        }
        else {
            colData(x)[[intgroup]]
        }
  
  d <- data.frame(PC1 = pca$x[, pc1], PC2 = pca$x[, pc2], group = group, 
                  intgroup.df, names = colnames(x))
  if (returnData) {
    attr(d, "percentVar") <- percentVar[c(pc1,pc2)]
    return(d)
  }
  ggplot(data = d, 
         aes_string(x = "PC1", y = "PC2",  color = "group"))+
    geom_point(size=3) +
    xlab(paste0("PC",pc1,": ", round(percentVar[pc1] * 100), "% variance")) +
    ylab(paste0("PC",pc2,": ", round(percentVar[pc2] * 100), "% variance"))
}

Using rtracklayer and the gtf file to obtain gene names

gtf <- rtracklayer::import('genomes/Caenorhabditis_elegans.WBcel235.99.gtf')
gtf_df=as.data.frame(gtf)
t2g <- data.frame(target_id = gtf_df$transcript_id,
                  ens_gene = gtf_df$gene_id,
                  ext_gene = gtf_df$gene_name)
t2g <- t2g %>% drop_na()
t2g <- dplyr::distinct(t2g)

The gene counts are in the .tab files of STAR. I moved those locally and use code from: http://biostars.org/p/241602 to read in counts so they can be used with DESeq2.

library(ggplot2)
library(DESeq2)
## Loading required package: SummarizedExperiment
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
## 
##     count
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum
#First, read in the gene counts from STAR
number<- 4 #reverse-stranded library so counts should be in column 4
ff <- list.files( path = "./star_mapping", pattern = "*ReadsPerGene.out.tab$", full.names = TRUE )
counts.files <- lapply( ff, read.table, skip = 4 )
starcounts <- as.data.frame( sapply( counts.files, function(x) x[ , number ] ) )
ff <- gsub( "[_]ReadsPerGene[.]out[.]tab", "", ff )
ff <- gsub( "[.]/star_mapping/", "", ff )
colnames(starcounts) <- ff
row.names(starcounts) <- counts.files[[1]]$V1
rm(counts.files)

coldata <- read.table("metadata.txt", header=T)
rownames(coldata) <- coldata[,1]
coldata
##        sample condition
## N1_S5   N1_S5   control
## N2_S6   N2_S6   control
## N3_S7   N3_S7   control
## N4_S8   N4_S8   control
## SG1_S1 SG1_S1  neurexin
## SG2_S2 SG2_S2  neurexin
## SG3_S3 SG3_S3  neurexin
## SG4_S4 SG4_S4  neurexin
#Check row and column names match in the same order
rownames(coldata)  == colnames(starcounts)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
dds = DESeqDataSetFromMatrix(countData = starcounts, 
                             colData = coldata, 
                             design = ~ condition )
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet 
## dim: 46904 8 
## metadata(1): version
## assays(1): counts
## rownames(46904): WBGene00197333 WBGene00198386 ... WBGene00010967
##   WBGene00014473
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#pre-filter to remove genes with too few counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
## class: DESeqDataSet 
## dim: 14163 8 
## metadata(1): version
## assays(1): counts
## rownames(14163): WBGene00015153 WBGene00002061 ... WBGene00010966
##   WBGene00010967
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#check quickly the distribution of the other counts
quantile(rowSums(counts(dds)))
##       0%      25%      50%      75%     100% 
##       10       51      364     2168 83054802
#and in more detail
quantile(rowSums(counts(dds)), probs=seq(0.9,1,0.005))
##         90%       90.5%         91%       91.5%         92%       92.5% 
##     5149.00     5294.22     5505.68     5756.92     6024.16     6250.85 
##         93%       93.5%         94%       94.5%         95%       95.5% 
##     6603.22     6954.88     7311.76     7756.54     8130.50     8767.13 
##         96%       96.5%         97%       97.5%         98%       98.5% 
##     9355.20    10231.64    11092.54    12633.10    14999.88    18921.49 
##         99%       99.5%        100% 
##    26026.46    38536.55 83054802.00
vsd <- vst(dds, blind=TRUE)

plotPCA(vsd, intgroup=c("condition"))

This initial PCA looks promising with good separation of the samples. However, the rRNAs remain in the counts as shown above and they could be skewing the results.

We need to remove all counts corresponding to rRNA transcript_biotype reads (there are 23) - the genome/transcriptome files are from Ensembl and can be found in this directory: /d/in16/u/ubcg71a/research/genomes/c.elegans

 zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep biotype:rRNA > list_of_biotype_rRNA_transcripts.txt
 
#keep gene names only
awk '{print $4}' list_of_biotype_rRNA_transcripts.txt | sed 's/gene://' | sed 's/\..//' > list_biotype_rRNA_genes.txt

We will now move this list of rRNA genes locally and use it to filter out the unwanted counts.

library("pheatmap")

#clean up
rm(ff, coldata, starcounts, dds, vsd)

number<- 4 #reverse-stranded library so counts should be in column 4
ff <- list.files( path = "./star_mapping", pattern = "*ReadsPerGene.out.tab$", full.names = TRUE )
counts.files <- lapply( ff, read.table, skip = 4 )
starcounts <- as.data.frame( sapply( counts.files, function(x) x[ , number ] ) )
ff <- gsub( "[_]ReadsPerGene[.]out[.]tab", "", ff )
ff <- gsub( "[.]/star_mapping/", "", ff )
colnames(starcounts) <- ff
row.names(starcounts) <- counts.files[[1]]$V1
rm(counts.files)

coldata <- read.table("metadata.txt", header=T)
rownames(coldata) <- coldata[,1]
coldata
##        sample condition
## N1_S5   N1_S5   control
## N2_S6   N2_S6   control
## N3_S7   N3_S7   control
## N4_S8   N4_S8   control
## SG1_S1 SG1_S1  neurexin
## SG2_S2 SG2_S2  neurexin
## SG3_S3 SG3_S3  neurexin
## SG4_S4 SG4_S4  neurexin
#Check row and column names match in the same order
rownames(coldata)  == colnames(starcounts)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#now remove the rRNAs from the counts table
rRNAgenes <- read.table(file="list_biotype_rRNA_genes.txt", header=F)
rRNAgenes
##                V1
## 1  WBGene00004677
## 2  WBGene00004512
## 3  WBGene00004513
## 4  WBGene00004567
## 5  WBGene00004622
## 6  WBGene00014454
## 7  WBGene00014472
## 8  WBGene00014621
## 9  WBGene00077465
## 10 WBGene00077466
## 11 WBGene00077467
## 12 WBGene00077468
## 13 WBGene00077469
## 14 WBGene00077470
## 15 WBGene00077471
## 16 WBGene00077472
## 17 WBGene00077473
## 18 WBGene00077474
## 19 WBGene00077475
## 20 WBGene00077476
## 21 WBGene00077477
## 22 WBGene00189966
## 23 WBGene00235197
starcounts.norRNA <- starcounts[!row.names(starcounts)%in%rRNAgenes[,1],]
dim(starcounts)
## [1] 46904     8
dim(starcounts.norRNA)
## [1] 46881     8
dim(rRNAgenes)
## [1] 23  1
rm(starcounts) #clean up
dds = DESeqDataSetFromMatrix(countData = starcounts.norRNA, 
                             colData = coldata, 
                             design = ~ condition )
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet 
## dim: 46881 8 
## metadata(1): version
## assays(1): counts
## rownames(46881): WBGene00197333 WBGene00198386 ... WBGene00010967
##   WBGene00014473
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#pre-filter to remove genes with too few counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
## class: DESeqDataSet 
## dim: 14156 8 
## metadata(1): version
## assays(1): counts
## rownames(14156): WBGene00015153 WBGene00002061 ... WBGene00010966
##   WBGene00010967
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#check quickly the distribution of the other counts
quantile(rowSums(counts(dds)))
##        0%       25%       50%       75%      100% 
##     10.00     51.00    363.00   2162.25 240054.00
#and in more detail
quantile(rowSums(counts(dds)), probs=seq(0.9,1,0.005))
##        90%      90.5%        91%      91.5%        92%      92.5%        93% 
##   5137.500   5279.275   5485.100   5740.000   6012.600   6230.750   6569.650 
##      93.5%        94%      94.5%        95%      95.5%        96%      96.5% 
##   6916.775   7291.800   7736.800   8108.000   8734.175   9302.200  10166.250 
##        97%      97.5%        98%      98.5%        99%      99.5%       100% 
##  11065.100  12575.500  14494.600  18757.400  25663.900  37461.975 240054.000
#show the genes that still have very large counts
i<- rowSums(counts(dds)) >= 100000
assay(dds)[i,]
##                N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## WBGene00000915 13312 12611 10603 15602  12740  10132  12254  14820
## WBGene00006929 18386 19388 18132 20877  18804  15398  18484  23763
## WBGene00006926 27452 26867 26905 28783  20753  17046  19889  25582
## WBGene00006930 29295 29066 28874 31287  30451  24642  29171  37268
## WBGene00001167 11896 10835 10303 14872  18855  12461  18973  21002
## WBGene00001168 21195 20607 20916 26436  23787  17016  25938  33677
#Check correlation between raw gene expression vectors for all samples (without log transformation)
pairs(assay(dds))

#and following log transformation
pairs(log(assay(dds)+0.5)) #add 0.5 as pseudocount

cor.mat <- cor(assay(dds))
cor.mat
##            N1_S5     N2_S6     N3_S7     N4_S8    SG1_S1    SG2_S2    SG3_S3
## N1_S5  1.0000000 0.9876924 0.9951246 0.9802646 0.9524706 0.9692726 0.9512352
## N2_S6  0.9876924 1.0000000 0.9900583 0.9869876 0.9472874 0.9698440 0.9392737
## N3_S7  0.9951246 0.9900583 1.0000000 0.9796024 0.9443960 0.9618507 0.9429706
## N4_S8  0.9802646 0.9869876 0.9796024 1.0000000 0.9705380 0.9718217 0.9649730
## SG1_S1 0.9524706 0.9472874 0.9443960 0.9705380 1.0000000 0.9881489 0.9968161
## SG2_S2 0.9692726 0.9698440 0.9618507 0.9718217 0.9881489 1.0000000 0.9817127
## SG3_S3 0.9512352 0.9392737 0.9429706 0.9649730 0.9968161 0.9817127 1.0000000
## SG4_S4 0.9579343 0.9485628 0.9532682 0.9725760 0.9954057 0.9829288 0.9972798
##           SG4_S4
## N1_S5  0.9579343
## N2_S6  0.9485628
## N3_S7  0.9532682
## N4_S8  0.9725760
## SG1_S1 0.9954057
## SG2_S2 0.9829288
## SG3_S3 0.9972798
## SG4_S4 1.0000000
pheatmap(cor.mat, legend=T)

#Transform the counts 
vsd <- vst(dds, blind=TRUE)
plotPCA(vsd, intgroup=c("condition"))

#As expected the PCA is not changed much because we only removed a few genes

#try rlog
rld <- rlog(dds, blind = TRUE)
plotPCA(rld, intgroup=c("condition"))

#we can improve the plot a bit by highlighting the replicate number in different shape
pcaData <- plotPCA(vsd, intgroup=c("condition", "sample"), returnData=TRUE)
pcaData$replicate <- sub(pattern= ".*(.)_.*", 
                         replacement="\\1", 
                         perl=TRUE, fixed=FALSE, 
                         x=pcaData$sample)
pcaData
##               PC1        PC2           group condition sample   name replicate
## N1_S5  -10.515292  3.9809806   control:N1_S5   control  N1_S5  N1_S5         1
## N2_S6  -11.343047 -4.3937220   control:N2_S6   control  N2_S6  N2_S6         2
## N3_S7  -10.735343  3.8215511   control:N3_S7   control  N3_S7  N3_S7         3
## N4_S8   -8.883619 -3.2672766   control:N4_S8   control  N4_S8  N4_S8         4
## SG1_S1  10.795760 -0.7681986 neurexin:SG1_S1  neurexin SG1_S1 SG1_S1         1
## SG2_S2   8.886207 -2.7062952 neurexin:SG2_S2  neurexin SG2_S2 SG2_S2         2
## SG3_S3  11.160175  1.7500821 neurexin:SG3_S3  neurexin SG3_S3 SG3_S3         3
## SG4_S4  10.635157  1.5828787 neurexin:SG4_S4  neurexin SG4_S4 SG4_S4         4
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=replicate)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()

pcaData <- myPlotPCA(vsd, pc1= 2, pc2= 3, intgroup=c("condition", "sample"), returnData=TRUE)
pcaData$replicate <- sub(pattern= ".*(.)_.*", 
                         replacement="\\1", 
                         perl=TRUE, fixed=FALSE, 
                         x=pcaData$sample)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=replicate)) +
geom_point(size=3) +
xlab(paste0("PC2: ",percentVar[1],"% variance")) +
ylab(paste0("PC3: ",percentVar[2],"% variance")) +
coord_fixed()

#Check correlation between transformed gene expression vectors for all samples
pairs(assay(vsd))

cor.mat <- cor(assay(vsd))
cor.mat
##            N1_S5     N2_S6     N3_S7     N4_S8    SG1_S1    SG2_S2    SG3_S3
## N1_S5  1.0000000 0.9897970 0.9926442 0.9895144 0.9778167 0.9812096 0.9779625
## N2_S6  0.9897970 1.0000000 0.9904951 0.9911871 0.9752235 0.9819031 0.9726985
## N3_S7  0.9926442 0.9904951 1.0000000 0.9884828 0.9758348 0.9803706 0.9760524
## N4_S8  0.9895144 0.9911871 0.9884828 1.0000000 0.9836709 0.9840315 0.9817861
## SG1_S1 0.9778167 0.9752235 0.9758348 0.9836709 1.0000000 0.9919673 0.9944142
## SG2_S2 0.9812096 0.9819031 0.9803706 0.9840315 0.9919673 1.0000000 0.9898898
## SG3_S3 0.9779625 0.9726985 0.9760524 0.9817861 0.9944142 0.9898898 1.0000000
## SG4_S4 0.9789728 0.9741121 0.9774286 0.9830008 0.9947647 0.9907127 0.9951929
##           SG4_S4
## N1_S5  0.9789728
## N2_S6  0.9741121
## N3_S7  0.9774286
## N4_S8  0.9830008
## SG1_S1 0.9947647
## SG2_S2 0.9907127
## SG3_S3 0.9951929
## SG4_S4 1.0000000
pheatmap(cor.mat, legend=T)

Next, we carry out differential gene expression based on the gene counts from STAR mapping. Note that this approach is now pretty much deprecated and replaced by transcript-based analysis summarised at the gene level so we won't really analyse the results here but add them here for completion.

#BiocManager::install("apeglm")
library(apeglm) #to use in the lfcShrink function

dds <- DESeq(dds) #estimates size factors and dispersions and fits the model
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
res
## log2 fold change (MLE): condition neurexin vs control 
## Wald test p-value: condition neurexin vs control 
## DataFrame with 14156 rows and 6 columns
##                        baseMean      log2FoldChange             lfcSE
##                       <numeric>           <numeric>         <numeric>
## WBGene00015153 1.40026331133511    1.23369435011502  1.25556206618441
## WBGene00002061 1023.39358270591 -0.0561810129305647 0.111894106778005
## WBGene00001177 236.878096958743   0.645083433822109 0.130496798969703
## WBGene00021412 24.8706101313319   0.267994653237935 0.287508181787133
## WBGene00018976 6.03204079332923  0.0830540521723233 0.559216117300581
## ...                         ...                 ...               ...
## WBGene00010963 646.235482854105  0.0560344400948453 0.532154321210568
## WBGene00010964 6610.73359986534  -0.301756164128778 0.339749774394959
## WBGene00010965 698.469898642392 -0.0714973118809531 0.592928262802426
## WBGene00010966 54.8204097454485   0.488445620632377 0.207379835476911
## WBGene00010967 2519.35135960664   0.508179713782424 0.122171381252483
##                              stat               pvalue                 padj
##                         <numeric>            <numeric>            <numeric>
## WBGene00015153   0.98258332530239    0.325812554464328                   NA
## WBGene00002061 -0.502090901373623    0.615603580096765    0.755603864950982
## WBGene00001177   4.94328932904997 7.68153514674162e-07 1.49941766583249e-05
## WBGene00021412   0.93212878872558     0.35126997168092    0.521797355147636
## WBGene00018976  0.148518702524594    0.881933427153093    0.933540011973405
## ...                           ...                  ...                  ...
## WBGene00010963  0.105297350526771    0.916139865263089    0.954440625434673
## WBGene00010964 -0.888171786621956    0.374448352556327    0.544673615403436
## WBGene00010965 -0.120583410112763    0.904021009760152    0.946326013043526
## WBGene00010966   2.35531877778328   0.0185068216532182   0.0590552767545967
## WBGene00010967    4.1595642823437 3.18855275260321e-05 0.000347870583450949
#next we want to shrink the LFC to help visualisation and ranking of genes
resultsNames(dds)
## [1] "Intercept"                     "condition_neurexin_vs_control"
resLFC <- lfcShrink(dds, coef="condition_neurexin_vs_control", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
## log2 fold change (MAP): condition neurexin vs control 
## Wald test p-value: condition neurexin vs control 
## DataFrame with 14156 rows and 5 columns
##                        baseMean      log2FoldChange             lfcSE
##                       <numeric>           <numeric>         <numeric>
## WBGene00015153 1.40026331133511  0.0865693105221461 0.334873921280774
## WBGene00002061 1023.39358270591 -0.0501595379594478 0.106187975503463
## WBGene00001177 236.878096958743     0.6101360636599 0.131419234864905
## WBGene00021412 24.8706101313319    0.16047086366021 0.231989548013496
## WBGene00018976 6.03204079332923  0.0219340702385293 0.285509651993327
## ...                         ...                 ...               ...
## WBGene00010963 646.235482854105  0.0165071893007835 0.281593365688285
## WBGene00010964 6610.73359986534   -0.15473391499059  0.25612671632153
## WBGene00010965 698.469898642392 -0.0161784771837534 0.289572855203541
## WBGene00010966 54.8204097454485   0.398418895186059 0.203231809685816
## WBGene00010967 2519.35135960664   0.476542635901622 0.122242317594528
##                              pvalue                 padj
##                           <numeric>            <numeric>
## WBGene00015153    0.325812554464328                   NA
## WBGene00002061    0.615603580096765    0.755603864950982
## WBGene00001177 7.68153514674162e-07 1.49941766583249e-05
## WBGene00021412     0.35126997168092    0.521797355147636
## WBGene00018976    0.881933427153093    0.933540011973405
## ...                             ...                  ...
## WBGene00010963    0.916139865263089    0.954440625434673
## WBGene00010964    0.374448352556327    0.544673615403436
## WBGene00010965    0.904021009760152    0.946326013043526
## WBGene00010966   0.0185068216532182   0.0590552767545967
## WBGene00010967 3.18855275260321e-05 0.000347870583450949
#MA plot
plotMA(resLFC, ylim=c(-2,2))

#contrast with the same plot but for the unshrunken log fold changes
plotMA(res, ylim=c(-2,2))

#plot top few genes
resLFCOrdered <- resLFC[order(resLFC$pvalue),]
resLFCOrdered[1:10,]
## log2 fold change (MAP): condition neurexin vs control 
## Wald test p-value: condition neurexin vs control 
## DataFrame with 10 rows and 5 columns
##                        baseMean    log2FoldChange             lfcSE
##                       <numeric>         <numeric>         <numeric>
## WBGene00008862  222.57019613529  5.57679743055826 0.289049887335795
## WBGene00006541 816.161213344128 -1.93485791500237 0.102347519609402
## WBGene00016627 533.260899008354 -2.19269946224634 0.118338343000403
## WBGene00000657 369.382836201042  2.89687520175153 0.167550806546769
## WBGene00000712 519.893685556946  2.72103817505864 0.171840363835382
## WBGene00020218 263.048181538241 -2.09074236696147 0.132190159983885
## WBGene00016453 332.089024007193  -1.9220886364233 0.124163319245512
## WBGene00000703 450.872355408933  2.65300765292485 0.174067247594477
## WBGene00018393 711.229164789001 -1.68439569805028 0.117303671006414
## WBGene00022644 337.023507140299  1.58598181801307 0.115549105174495
##                              pvalue                 padj
##                           <numeric>            <numeric>
## WBGene00008862 3.35503734507776e-83 4.47293578845766e-79
## WBGene00006541 6.37282902848922e-81 4.24812783039092e-77
## WBGene00016627 6.63788957158305e-78 2.94987812561151e-74
## WBGene00000657 2.60644403051886e-68 8.68727795371936e-65
## WBGene00000712 7.67188409824923e-58 2.04563117595718e-54
## WBGene00020218 1.40135503067076e-57 3.11381087815043e-54
## WBGene00016453 2.85659809132675e-55 5.44059510765261e-52
## WBGene00000703 7.94087824248563e-54 1.32334735911023e-50
## WBGene00018393 6.21843170000284e-48 9.21157015827087e-45
## WBGene00022644 5.18169923397144e-44 6.90824141873072e-41
rm(resLFC)
for (i in 1:10) {
 plotCounts(dds, gene=rownames(resLFCOrdered)[i], intgroup="condition")
}

Further exploration of the data using the pcaExplorer package in R (just trying out - DO NOT INCLUDE)

#devtools::install_github("rstudio/d3heatmap") #doesn't install otherwise
#BiocManager::install("pcaExplorer")

library(pcaExplorer)

#need a mapping from Ensembl ids to gene names; will use a t2g subset; row names must map gene ids
t2g.sub <- t2g[,c("ens_gene", "ext_gene")]
t2g.sub<- dplyr::distinct(t2g.sub)
rownames(t2g.sub) <- t2g.sub[,"ens_gene"]
colnames(t2g.sub) <- c("ens_gene", "gene_name")

#pcaExplorer(dds = dds, dst = vsd, annotation= t2g.sub)

#Conclusion: Good for quick exploration but the output downloaded plot for loadings did not agree with the one shown online so I'm not sure how buggy it is....(on another trial it worked...)
#Will explore loadings outside this...

Alt text

Further exploration of the data using the PCAtools package from Bioconductor

#BiocManager::install('PCAtools')

library(PCAtools)
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Loading required package: lattice
## Loading required package: cowplot
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
## 
## Attaching package: 'PCAtools'
## The following objects are masked from 'package:stats':
## 
##     biplot, screeplot
p<- pca(assay(vsd), metadata=coldata)

screeplot(p)
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).

biplot(p, colby="condition")

pairsplot(p, colby= "condition")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

plotloadings(p)
## -- variables retained:
## WBGene00008862, WBGene00020218, WBGene00006541, WBGene00016627, WBGene00021468, WBGene00000716, WBGene00000656, WBGene00195017, WBGene00195016, WBGene00010314, WBGene00271656, WBGene00010965, WBGene00045168, WBGene00077652, WBGene00016534, WBGene00013699, WBGene00016457, WBGene00010215, WBGene00001789, WBGene00044922, WBGene00017661, WBGene00009523, WBGene00011345, WBGene00011974, WBGene00022240, WBGene00303364, WBGene00009129

loadings(p)[which.max(abs(loadings(p)$PC1)),]
##                       PC1          PC2        PC3         PC4         PC5
## WBGene00008862 0.09067169 -0.008180751 0.03666657 0.003266629 -0.03149979
##                         PC6          PC7         PC8
## WBGene00008862 -0.000739828 0.0002908326 0.002464358
loadings(p)[which.max(abs(loadings(p)$PC2)),]
##                          PC1         PC2         PC3          PC4           PC5
## WBGene00010965 -0.0002111257 -0.07537175 -0.00877533 -0.006426331 -0.0006274245
##                        PC6          PC7          PC8
## WBGene00010965 0.004823226 -0.005803499 -0.009104016

Observations from PCA loadings plot:

  • Largest contribution to PC1 is from gene WBGene00008862 (F15D4.5). There is very little annotation for this gene, except that it plays some regulatory role and interacts with deps-1, a gene localized to P granules. HHblits hits almost exclusively uncharacterised proteins but the C-terminal seems to map to CCHC-type domains (which are zinc fingers). It also hits some Pro-Pol polyproteins across a much larger part of the protein sequence (again at the C-terminus)
  • Large negative loadings for PC1: WBGene00016627 (C44B7.5), an orthologue of human ferric chelate reductase 1. Human orthologues of this gene are implicated in early infantile epileptic encephalopathy 37. WBGene00021468 (epg-2), involved in macroautophagy and negative regulation of autophagosome assmebly.
  • Largest (negative) loading for PC2: WBGene00010965 (ctc-2), cytochrome c oxidase activity, affected by daf-2

Differential expression analysis based on isoform quantification from kallisto

Dealing with ribosomal RNAs

Aine's work showed a huge number of reads mapping to rRNA. I've decided it's probably best to remove them BEFORE proceeding with anything else so we don't have to worry about them later. I saw that many people use bwa or bowtie2 for this but rRNAs of eukaryotes can also have introns so would need a spliced aligner. Other solutions; BBsplit from BBmap can split reads from different references but does it based on k-mers. SortMeRNA also removes reads that map to a set of sequences (used mostly with prokaryotes and metagenomes..would need to check if it works well with eukaryotes).

I decided in the end that the easiest solution might be to rebuild the index that kallisto uses and re-run kallisto so that the rRNAs are not part of the reference transcriptome, allowing the reads to remain unmapped (hopefully!).

We need to remove all sequences with rRNA transcript_biotype reads (there are 23) - the genome/transcriptome files are from Ensembl and can be found in this directory: /d/in16/u/ubcg71a/research/genomes/c.elegans

 zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep biotype:rRNA > list_of_biotype_rRNA_transcripts.txt

Found a one-liner online to do this easily (basically, stitches the lines together separated by "", greps and then separates again with newlines:

zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' | grep -vFf pattern.txt - | tr "\t" "\n" > Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa

Checking:

grep ">" Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa | wc
61428
zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep ">" | wc
61451

wc list_of_biotype_rRNA_transcripts.txt
23

Then zip the final .fa file:

gzip -9 Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa

Indexing the transcriptome with kallisto

(all transcriptome files are under: /d/in16/u/ubcg71a/research/genomes/c.elegans See README for how they were obtained):

/d/in7/s/kallisto/kallisto_linux-v0.46.1/kallisto index -i Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.index Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa.gz

Pseudo-mapping with kallisto

This blog: http://genomespot.blogspot.com/2015/08/how-accurate-is-kallisto.html suggests that the difference between trimming before kallisto or not trimming does not make a huge difference overall.

So I did not run Trimmomatic on top of fastp as I would have done if we were mapping with STAR or similar.

Running first without pseudoalignment (which takes a lot more time and space) but with bootstrapping (this is run in the directory $PROJECT/kallisto_with_ncRNA_norRNA/):

nohup ./run_kallisto.sh ../fastp_merged/ N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4 >& run_kallisto.out &

I got initially rather low mapping rates (about 20%) and was worried I had the orientation of the reads wrong. I reversed the orientation and that gave less than 1% alignment so that was clearly not the problem. In this forum: https://github.com/pachterlab/kallisto/issues/198 there is a suggestion that the culprit is ncRNAs that are not included in the transcriptome file, which would make sense since we know that 80% of the reads are to rRNA in this case.

I re-did the run with a transcriptome that had ncRNA too. Now, alignment jumpted to 98.2% so the ncRNA missing was definitely the problem here (rRNA really..) These results were under kallisto_with_ncRNA, a directory that I eventually deleted to save space. The results without ncRNA are now renamed to kallisto_cDNA NOTE: the abundance estimates for the bootstrap are not written by default to the csv file but they can be written out using:

/d/in7/s/kallisto/kallisto_linux-v0.46.1/kallisto h5dump -o output abundance.h5

(where output is a directory where the results will be saved)

I will probably keep both versions and run sleuth with both to see whether there are any differences that affect the coding genes too.

Finally, I re-did the run with a transcriptome that had ncRNA but no rRNA. I then deleted the previous run that had ncRNA, including rRNAs to save space. So the kallisto_with_ncRNA directory has now been replaced with: kallisto_with_ncRNA_norRNA Alignment rates: p_unique: 9-10%, p_pseudoaligned: ~20%

I also did one re-run with ncRNA and no bootstrap but asking for pseudo-alignment to the genome so that we could get pseudo-bam files for visualisation: In directory: /d/in16/u/ubcg71a/research/filipe/kallisto_with_ncRNA_norRNA/pseudo_bam/

nohup ./run_kallisto_pseudobam.sh ../../fastp_merged/ N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4 >& run_kallisto.out &

Gene-level differential expression using kallisto counts and sleuth

This approach is following the code from: https://github.com/pachterlab/aggregationDE/blob/master/R/gene_pipeline and the procedure of aggregating p-values after transcript differential expression has been carried out to call differentially expressed genes (see paper by Yi et al: doi: 10.1186/s13059-018-1419-z).

Setup directories and variables

'%!in%' <- function(x,y)!('%in%'(x,y))

setwd("~/Dropbox/work/research/Filipe_neurexin/NEW_PROCESS/")

#sample_id <- dir(file.path(".", "kallisto_cdna"))
sample_id <- dir(file.path(".", "kallisto_with_ncRNA_norRNA"))
sample_id
## [1] "N1_S5"  "N2_S6"  "N3_S7"  "N4_S8"  "SG1_S1" "SG2_S2" "SG3_S3" "SG4_S4"
#kal_dirs <- file.path(".", "kallisto_cdna", sample_id)
kal_dirs <- file.path(".", "kallisto_with_ncRNA_norRNA", sample_id)

Read in metadata

Sleuth requires metadata information corresponding to the samples in the dataset. In this case, the metadata is very simple and was already created outside the R script. We simply need to add to this table the directories where kallisto results are saved.

s2c <- read.table("./metadata.txt",  header = TRUE, stringsAsFactors=FALSE)
s2c
##   sample condition
## 1  N1_S5   control
## 2  N2_S6   control
## 3  N3_S7   control
## 4  N4_S8   control
## 5 SG1_S1  neurexin
## 6 SG2_S2  neurexin
## 7 SG3_S3  neurexin
## 8 SG4_S4  neurexin
s2c <- dplyr::mutate(s2c, path = kal_dirs)
s2c
##   sample condition                                path
## 1  N1_S5   control  ./kallisto_with_ncRNA_norRNA/N1_S5
## 2  N2_S6   control  ./kallisto_with_ncRNA_norRNA/N2_S6
## 3  N3_S7   control  ./kallisto_with_ncRNA_norRNA/N3_S7
## 4  N4_S8   control  ./kallisto_with_ncRNA_norRNA/N4_S8
## 5 SG1_S1  neurexin ./kallisto_with_ncRNA_norRNA/SG1_S1
## 6 SG2_S2  neurexin ./kallisto_with_ncRNA_norRNA/SG2_S2
## 7 SG3_S3  neurexin ./kallisto_with_ncRNA_norRNA/SG3_S3
## 8 SG4_S4  neurexin ./kallisto_with_ncRNA_norRNA/SG4_S4

Using biomaRt to map transcript ids to gene names and descriptions

We will get gene names from biomaRt to add to the transcript IDs we have from sleuth

ens <- useMart("ensembl")
attr = c("ensembl_gene_id", 
         "ensembl_transcript_id",
         "description",
         "external_gene_name",
         "entrezgene_id")
celeg <- useMart("ensembl",
                   dataset = "celegans_gene_ensembl")

t2g <- getBM(attributes = attr,
              mart = celeg)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Cache found
t2g <- dplyr::rename(t2g, 
                     target_id = ensembl_transcript_id,
                     ens_gene = ensembl_gene_id, 
                     ext_gene = external_gene_name)

Differential expression with sleuth

#### GENE-LEVEL-ANALYSIS ####
#Following code from: https://github.com/pachterlab/aggregationDE/blob/master/R/gene_pipeline.R
design <- ~ condition
so <- sleuth_prep(s2c,
                  full_model = design,
                  read_bootstrap_tpm = TRUE,
                  extra_bootstrap_summary=TRUE,
                  target_mapping = t2g,
                  transformation_function = function(x) log2(x + 0.5), #change the natural log fold change to log2 fold change
                  aggregation_column = 'ens_gene'
                  #filter_target_id=txfilter
                  )
## Warning in check_num_cores(num_cores): It appears that you are running Sleuth from within Rstudio.
## Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
## If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
## reading in kallisto results
## dropping unused factor levels
## ........
## normalizing est_counts
## 15845 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## ........
so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## 4 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
## The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
## These are the target ids with NA values: T23E7.4a.2, C04F6.1.1, C42D8.2a.1, K07H8.6a.1
## computing variance of betas
#do a Wald test first
so <- sleuth_wt(so, which_beta='conditionneurexin', which_model='full')
sleuth_table_wt <- sleuth_results(so, 
                                  test= 'conditionneurexin', 
                                  test_type =  'wt',
                                  which_model = 'full')
sleuth_table_wt[1:5,]
##        target_id                                                 description
## 1 WBGene00016627                                                            
## 2 WBGene00008862                                                            
## 3 WBGene00020218                                                            
## 4 WBGene00016453 Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5 WBGene00000657              COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
##   ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1  C44B7.5        174124                          1            6.017851
## 2  F15D4.5        174969                          1            4.148146
## 3  T04G9.7        180424                          1            5.336970
## 4    vet-2        172993                          1            5.606577
## 5   col-81        174686                          1            5.464653
##           pval         qval
## 1 7.057950e-85 8.001598e-81
## 2 1.613360e-81 9.145333e-78
## 3 1.576815e-60 5.958785e-57
## 4 1.408001e-49 3.990628e-46
## 5 3.739522e-46 8.478992e-43
sleuth_significant_wt <- dplyr::filter(sleuth_table_wt, qval <= 0.05)
dim(sleuth_significant_wt)
## [1] 2339    8
head(sleuth_significant_wt, 20)
##         target_id
## 1  WBGene00016627
## 2  WBGene00008862
## 3  WBGene00020218
## 4  WBGene00016453
## 5  WBGene00000657
## 6  WBGene00022644
## 7  WBGene00012557
## 8  WBGene00003977
## 9  WBGene00010158
## 10 WBGene00010808
## 11 WBGene00007196
## 12 WBGene00018605
## 13 WBGene00017648
## 14 WBGene00000301
## 15 WBGene00018138
## 16 WBGene00009897
## 17 WBGene00018393
## 18 WBGene00021005
## 19 WBGene00011432
## 20 WBGene00000703
##                                                                             description
## 1                                                                                      
## 2                                                                                      
## 3                                                                                      
## 4                           Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6  Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7                                                                                      
## 8        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 10                                                                                     
## 11                                                                                     
## 12                                                                                     
## 13                      D-aspartate oxidase 3  [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 14                                 Caveolin-1  [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 15                  Folate-like transporter 2  [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 16                          Peroxidase skpo-1  [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 17               Methionine Sulfoxide Reductase A  [Source:UniProtKB/TrEMBL;Acc:O02089]
## 18             Uterine Lumin Expressed/locailized  [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 19             SKN-1 Dependent Zygotic transcript  [Source:UniProtKB/TrEMBL;Acc:O02297]
## 20                                       COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q21556]
##     ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1    C44B7.5        174124                          1            6.017851
## 2    F15D4.5        174969                          1            4.148146
## 3    T04G9.7        180424                          1            5.336970
## 4      vet-2        172993                          1            5.606577
## 5     col-81        174686                          1            5.464653
## 6     dod-19        178564                          1            5.680775
## 7  Y37D8A.19        176711                          1            6.422207
## 8    pes-2.1        173025                          1            4.532240
## 9    pes-2.2        173026                          1            4.532240
## 10    sepa-1        173196                          1            4.525542
## 11   B0513.4        178351                          1            5.248247
## 12   F48E3.4        181006                          1            6.154206
## 13     ddo-3        184746                          1            5.835642
## 14     cav-1        177815                          1            5.226113
## 15    folt-2        178745                          1            7.013395
## 16    skpo-1        174340                          1            6.620497
## 17    msra-1        185709                          2           11.335050
## 18     ule-1        171778                          1            6.022399
## 19    sdz-30        173198                          1            4.018878
## 20   col-129        178155                          1            5.713564
##            pval         qval
## 1  7.057950e-85 8.001598e-81
## 2  1.613360e-81 9.145333e-78
## 3  1.576815e-60 5.958785e-57
## 4  1.408001e-49 3.990628e-46
## 5  3.739522e-46 8.478992e-43
## 6  6.131492e-41 1.158545e-37
## 7  3.821684e-39 6.189491e-36
## 8  4.000704e-38 5.039554e-35
## 9  4.000704e-38 5.039554e-35
## 10 8.901306e-38 1.009141e-34
## 11 1.780045e-37 1.834579e-34
## 12 1.672903e-36 1.580475e-33
## 13 6.915372e-36 6.030737e-33
## 14 1.116744e-33 9.043235e-31
## 15 3.081094e-32 2.328690e-29
## 16 5.041138e-31 3.571961e-28
## 17 3.562778e-30 2.375954e-27
## 18 2.273631e-29 1.432008e-26
## 19 2.532942e-29 1.511367e-26
## 20 5.864490e-29 3.324286e-26
#and then a LRT test
so <- sleuth_fit(so, ~1, 'null')
## fitting measurement error models
## shrinkage estimation
## 4 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
## The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
## These are the target ids with NA values: T23E7.4a.2, C04F6.1.1, C42D8.2a.1, K07H8.6a.1
## computing variance of betas
so <- sleuth_lrt(so, 'null', 'full')
sleuth_table_lrt <- sleuth_results(so, test='null:full', 'lrt')
sleuth_table_lrt[1:5,]
##        target_id
## 1 WBGene00008862
## 2 WBGene00018393
## 3 WBGene00016627
## 4 WBGene00020218
## 5 WBGene00006436
##                                                              description
## 1                                                                       
## 2 Methionine Sulfoxide Reductase A  [Source:UniProtKB/TrEMBL;Acc:O02089]
## 3                                                                       
## 4                                                                       
## 5                Titin homolog  [Source:UniProtKB/Swiss-Prot;Acc:G4SLH0]
##   ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1  F15D4.5        174969                          1            4.148146
## 2   msra-1        185709                          2           11.335050
## 3  C44B7.5        174124                          1            6.017851
## 4  T04G9.7        180424                          1            5.336970
## 5    ttn-1        266969                         15           51.353311
##           pval         qval
## 1 3.530000e-09 2.000981e-05
## 2 2.804708e-09 2.000981e-05
## 3 5.541369e-09 2.094083e-05
## 4 8.194098e-09 2.322412e-05
## 5 1.147580e-08 2.602022e-05
sleuth_significant_lrt <- dplyr::filter(sleuth_table_lrt, qval <= 0.05)
dim(sleuth_significant_lrt)
## [1] 2251    8
head(sleuth_significant_lrt, 20)
##         target_id
## 1  WBGene00008862
## 2  WBGene00018393
## 3  WBGene00016627
## 4  WBGene00020218
## 5  WBGene00006436
## 6  WBGene00016453
## 7  WBGene00013567
## 8  WBGene00022644
## 9  WBGene00009007
## 10 WBGene00009897
## 11 WBGene00018602
## 12 WBGene00000657
## 13 WBGene00007196
## 14 WBGene00017648
## 15 WBGene00010808
## 16 WBGene00000301
## 17 WBGene00003977
## 18 WBGene00010158
## 19 WBGene00010204
## 20 WBGene00010822
##                                                                             description
## 1                                                                                      
## 2                Methionine Sulfoxide Reductase A  [Source:UniProtKB/TrEMBL;Acc:O02089]
## 3                                                                                      
## 4                                                                                      
## 5                               Titin homolog  [Source:UniProtKB/Swiss-Prot;Acc:G4SLH0]
## 6                           Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 7                                                                                      
## 8  Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 9       OTUBain deubiquitylating protease homolog  [Source:UniProtKB/TrEMBL;Acc:Q19681]
## 10                          Peroxidase skpo-1  [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 11                                                                                     
## 12                                       COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 13                                                                                     
## 14                      D-aspartate oxidase 3  [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 15                                                                                     
## 16                                 Caveolin-1  [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 17       Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 18       Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 19                                                                                     
## 20                                                                                     
##     ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1    F15D4.5        174969                          1            4.148146
## 2     msra-1        185709                          2           11.335050
## 3    C44B7.5        174124                          1            6.017851
## 4    T04G9.7        180424                          1            5.336970
## 5      ttn-1        266969                         15           51.353311
## 6      vet-2        172993                          1            5.606577
## 7  Y75B12B.1        190716                          3           16.170090
## 8     dod-19        178564                          1            5.680775
## 9     otub-3        177679                          3           16.524906
## 10    skpo-1        174340                          1            6.620497
## 11   F48D6.4        180697                          3           10.502658
## 12    col-81        174686                          1            5.464653
## 13   B0513.4        178351                          1            5.248247
## 14     ddo-3        184746                          1            5.835642
## 15    sepa-1        173196                          1            4.525542
## 16     cav-1        177815                          1            5.226113
## 17   pes-2.1        173025                          1            4.532240
## 18   pes-2.2        173026                          1            4.532240
## 19   F57F5.1        179645                          1            7.583655
## 20  M01G12.9        187385                          1            3.599120
##            pval         qval
## 1  3.530000e-09 2.000981e-05
## 2  2.804708e-09 2.000981e-05
## 3  5.541369e-09 2.094083e-05
## 4  8.194098e-09 2.322412e-05
## 5  1.147580e-08 2.602022e-05
## 6  1.688221e-08 3.189893e-05
## 7  2.360899e-08 3.681739e-05
## 8  2.598034e-08 3.681739e-05
## 9  3.417326e-08 4.304692e-05
## 10 4.835407e-08 5.481901e-05
## 11 8.112164e-08 8.360691e-05
## 12 1.101088e-07 9.775006e-05
## 13 1.143796e-07 9.775006e-05
## 14 1.207110e-07 9.775006e-05
## 15 1.401189e-07 1.059018e-04
## 16 2.276656e-07 1.206595e-04
## 17 2.447886e-07 1.206595e-04
## 18 2.447886e-07 1.206595e-04
## 19 2.350286e-07 1.206595e-04
## 20 2.052485e-07 1.206595e-04
#Note: The Wald test usually returns more hits than the likelihood ratio test

#check counts for top significant genes in WT
selected_gene <- sleuth_significant_wt$target_id[1]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")

selected_gene <- sleuth_significant_wt$target_id[2]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")

selected_gene <- sleuth_significant_wt$target_id[3]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")

#save the gene differential expression results so they can be used with web servers for further analysis

#check the table has no duplicate entries
length(unique(sleuth_table_wt$target_id)) == length(sleuth_table_wt$target_id)
## [1] TRUE
write.table(sleuth_significant_wt$target_id, file="results/sleuth_significant_q005_NEW.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")

#avoid using this - pointless
write.table(dplyr::filter(sleuth_table_wt, qval > 0.05)$target_id, file="results/bg_genes_good.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")
#use this instead
write.table(dplyr::filter(sleuth_table_wt, !is.na(qval))$target_id, file="results/bg_genes_all.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")

##### TRANSCRIPT-LEVEL ANALYSIS #####
#To do the transcript-level analysis, simply set the pval_aggregate to FALSE
#NOTE that target_id will now be the transcript id rather than the gene id
#We will use the Wald test from here on
sleuth_table_txn <- sleuth_results(so, 
                                   test = 'conditionneurexin',
                                   test_type = 'wt',
                                   which_model = 'full',
                                   show_all = TRUE,
                                   pval_aggregate = FALSE)
sleuth_table_txn[1:5,]
##         ens_gene  target_id
## 1 WBGene00016627  C44B7.5.1
## 2 WBGene00008862  F15D4.5.1
## 3 WBGene00020218  T04G9.7.1
## 4 WBGene00016453 C35E7.1a.1
## 5 WBGene00000657  F38A3.1.1
##                                                   description ext_gene
## 1                                                              C44B7.5
## 2                                                              F15D4.5
## 3                                                              T04G9.7
## 4 Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]    vet-2
## 5              COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]   col-81
##   entrezgene_id         pval         qval         b       se_b mean_obs
## 1        174124 7.057950e-85 1.118332e-80 -1.517199 0.07771507 6.017851
## 2        174969 1.613360e-81 1.278185e-77  3.874707 0.20261631 4.148146
## 3        180424 1.576815e-60 8.328212e-57 -1.449539 0.08832342 5.336970
## 4        172993 1.408001e-49 5.577445e-46 -1.328045 0.08971666 5.606577
## 5        174686 3.739522e-46 1.185054e-42  2.030342 0.14235375 5.464653
##     var_obs    tech_var      sigma_sq smooth_sigma_sq final_sigma_sq
## 1 0.6664751 0.003078232  0.0071785109     0.009001033    0.009001033
## 2 4.3348560 0.059455577 -0.0065745514     0.022651161    0.022651161
## 3 0.6071586 0.005627500  0.0023363218     0.009974555    0.009974555
## 4 0.5104205 0.006637396  0.0009517097     0.009460763    0.009460763
## 5 1.2125366 0.006912328  0.0336168532     0.009669959    0.033616853
sleuth_significant_txn <- dplyr::filter(sleuth_table_txn, qval <= 0.05)
dim(sleuth_significant_txn)
## [1] 2263   15
head(sleuth_significant_txn, 20)
##          ens_gene   target_id
## 1  WBGene00016627   C44B7.5.1
## 2  WBGene00008862   F15D4.5.1
## 3  WBGene00020218   T04G9.7.1
## 4  WBGene00016453  C35E7.1a.1
## 5  WBGene00000657   F38A3.1.1
## 6  WBGene00022644   ZK6.10a.1
## 7  WBGene00012557 Y37D8A.19.1
## 8  WBGene00003977   F56G4.2.1
## 9  WBGene00010158   F56G4.3.1
## 10 WBGene00010808   M01E5.6.1
## 11 WBGene00007196  B0513.4a.1
## 12 WBGene00018605   F48E3.4.1
## 13 WBGene00017648  F20H11.5.1
## 14 WBGene00000301  T13F2.8a.1
## 15 WBGene00018138   F37B4.7.1
## 16 WBGene00009897  F49E12.1.1
## 17 WBGene00021005 W03F11.1a.1
## 18 WBGene00011432   T04D3.2.1
## 19 WBGene00000703    M18.1a.1
## 20 WBGene00010204   F57F5.1.1
##                                                                             description
## 1                                                                                      
## 2                                                                                      
## 3                                                                                      
## 4                           Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6  Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7                                                                                      
## 8        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 10                                                                                     
## 11                                                                                     
## 12                                                                                     
## 13                      D-aspartate oxidase 3  [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 14                                 Caveolin-1  [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 15                  Folate-like transporter 2  [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 16                          Peroxidase skpo-1  [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 17             Uterine Lumin Expressed/locailized  [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 18             SKN-1 Dependent Zygotic transcript  [Source:UniProtKB/TrEMBL;Acc:O02297]
## 19                                       COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q21556]
## 20                                                                                     
##     ext_gene entrezgene_id         pval         qval          b       se_b
## 1    C44B7.5        174124 7.057950e-85 1.118332e-80 -1.5171989 0.07771507
## 2    F15D4.5        174969 1.613360e-81 1.278185e-77  3.8747068 0.20261631
## 3    T04G9.7        180424 1.576815e-60 8.328212e-57 -1.4495391 0.08832342
## 4      vet-2        172993 1.408001e-49 5.577445e-46 -1.3280453 0.08971666
## 5     col-81        174686 3.739522e-46 1.185054e-42  2.0303425 0.14235375
## 6     dod-19        178564 6.131492e-41 1.619225e-37  1.1263589 0.08406315
## 7  Y37D8A.19        176711 3.821684e-39 8.650655e-36 -0.9832194 0.07511980
## 8    pes-2.1        173025 4.000704e-38 7.043462e-35 -1.7272661 0.13380245
## 9    pes-2.2        173026 4.000704e-38 7.043462e-35 -1.7272661 0.13380245
## 10    sepa-1        173196 8.901306e-38 1.410412e-34 -1.5304891 0.11912883
## 11   B0513.4        178351 1.780045e-37 2.564074e-34 -1.1603081 0.09069434
## 12   F48E3.4        181006 1.672903e-36 2.208929e-33 -0.9861263 0.07815018
## 13     ddo-3        184746 6.915372e-36 8.428775e-33 -0.9699391 0.07755740
## 14     cav-1        177815 1.116744e-33 1.263915e-30 -1.1029848 0.09119031
## 15    folt-2        178745 3.081094e-32 3.254662e-29 -0.8779173 0.07427461
## 16    skpo-1        174340 5.041138e-31 4.992302e-28 -0.8518164 0.07354191
## 17     ule-1        171778 2.273631e-29 2.119158e-26 -0.8533348 0.07584102
## 18    sdz-30        173198 2.532942e-29 2.229693e-26 -1.8614224 0.16557609
## 19   col-129        178155 5.864490e-29 4.890676e-26  1.8856724 0.16884974
## 20   F57F5.1        179645 2.536609e-28 2.009629e-25 -0.8851310 0.08019750
##    mean_obs   var_obs     tech_var      sigma_sq smooth_sigma_sq final_sigma_sq
## 1  6.017851 0.6664751 0.0030782318  7.178511e-03     0.009001033    0.009001033
## 2  4.148146 4.3348560 0.0594555773 -6.574551e-03     0.022651161    0.022651161
## 3  5.336970 0.6071586 0.0056274997  2.336322e-03     0.009974555    0.009974555
## 4  5.606577 0.5104205 0.0066373958  9.517097e-04     0.009460763    0.009460763
## 5  5.464653 1.2125366 0.0069123283  3.361685e-02     0.009669959    0.033616853
## 6  5.680775 0.3666408 0.0047652695  8.746247e-05     0.009367957    0.009367957
## 7  6.422207 0.2858795 0.0016874835  9.598486e-03     0.009127380    0.009598486
## 8  4.532240 0.8831048 0.0075861955  2.822000e-02     0.014554710    0.028219996
## 9  4.532240 0.8831048 0.0075861955  2.822000e-02     0.014554710    0.028219996
## 10 4.525542 0.6886811 0.0137410161  8.921226e-03     0.014642339    0.014642339
## 11 5.248247 0.3947901 0.0061483985  5.668441e-03     0.010302527    0.010302527
## 12 6.154206 0.2883113 0.0021870626  1.002784e-02     0.008995154    0.010027837
## 13 5.835642 0.2757498 0.0028869321  5.227220e-03     0.009143370    0.009143370
## 14 5.226113 0.3589810 0.0062272018  7.058848e-03     0.010404143    0.010404143
## 15 7.013395 0.2272664 0.0009177897  7.313406e-03     0.010115646    0.010115646
## 16 6.620497 0.2086972 0.0014829410  1.334146e-04     0.009333883    0.009333883
## 17 6.022399 0.2177003 0.0025040739  8.752858e-03     0.008999647    0.008999647
## 18 4.018878 1.0323351 0.0278325853  2.159394e-02     0.026998299    0.026998299
## 19 5.713564 1.0648062 0.0052853793  5.173509e-02     0.009324889    0.051735089
## 20 7.583655 0.2299978 0.0005303502  6.648096e-03     0.012332928    0.012332928
write.table(sleuth_significant_txn$target_id, file="results/sleuth_significant_TXN_q005.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")

#the transcript table contains beta values too so one can talk about up and down regulated transcripts and genes
sleuth_significant_txn.up <- dplyr::filter(sleuth_table_txn, (qval <= 0.05) & (b>0))
head(sleuth_significant_txn.up, 20)
##          ens_gene    target_id
## 1  WBGene00008862    F15D4.5.1
## 2  WBGene00000657    F38A3.1.1
## 3  WBGene00022644    ZK6.10a.1
## 4  WBGene00000703     M18.1a.1
## 5  WBGene00017186    F07B7.2.1
## 6  WBGene00021107    W09B7.2.1
## 7  WBGene00000712    F41F3.4.1
## 8  WBGene00001399    F10D2.9.1
## 9  WBGene00017071    D2096.3.1
## 10 WBGene00000998  K07E12.1a.1
## 11 WBGene00019617    K10C2.1.1
## 12 WBGene00010605   K06G5.1a.3
## 13 WBGene00001758 Y45G12C.2a.1
## 14 WBGene00012445   Y16B4A.2.1
## 15 WBGene00010822   M01G12.9.1
## 16 WBGene00020128     R193.2.1
## 17 WBGene00012018   T25C12.3.1
## 18 WBGene00022497 Y119D3B.21.1
## 19 WBGene00003515    K12F2.1.1
## 20 WBGene00021050      W05H9.3
##                                                                             description
## 1                                                                                      
## 2                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 3  Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 4                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q21556]
## 5                                                                                      
## 6                                                                                      
## 7                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20282]
## 8        Delta(9)-fatty-acid desaturase fat-7  [Source:UniProtKB/Swiss-Prot;Acc:G5EGH6]
## 9                   Acid Alpha Glucosidase Relate  [Source:UniProtKB/TrEMBL;Acc:Q19004]
## 10                                 Mesocentin  [Source:UniProtKB/Swiss-Prot;Acc:Q09165]
## 11                               Carboxypeptidase  [Source:UniProtKB/TrEMBL;Acc:Q94269]
## 12                                                                                     
## 13             Glutathione S-transferase P 10  [Source:UniProtKB/Swiss-Prot;Acc:Q9N4X8]
## 14                               Carboxypeptidase  [Source:UniProtKB/TrEMBL;Acc:K8ESM2]
## 15                                                                                     
## 16                                                                                     
## 17                                                                                     
## 18                                                                                     
## 19                                   Myosin-3  [Source:UniProtKB/Swiss-Prot;Acc:P12844]
## 20                                                                                     
##      ext_gene entrezgene_id         pval         qval         b       se_b
## 1     F15D4.5        174969 1.613360e-81 1.278185e-77 3.8747068 0.20261631
## 2      col-81        174686 3.739522e-46 1.185054e-42 2.0303425 0.14235375
## 3      dod-19        178564 6.131492e-41 1.619225e-37 1.1263589 0.08406315
## 4     col-129        178155 5.864490e-29 4.890676e-26 1.8856724 0.16884974
## 5     F07B7.2        179263 3.689346e-28 2.657167e-25 3.6706694 0.33360111
## 6     W09B7.2        189307 3.689346e-28 2.657167e-25 3.6706694 0.33360111
## 7     col-139        178857 5.933925e-28 4.087959e-25 1.9426318 0.17724348
## 8       fat-7        179100 7.520251e-28 4.964932e-25 0.8585808 0.07848960
## 9      aagr-1        177632 2.947247e-26 1.796120e-23 0.8908323 0.08403264
## 10      dig-1        175951 5.555927e-24 2.839795e-21 1.0040401 0.09941544
## 11    K10C2.1        180915 6.912340e-24 3.422688e-21 0.7501839 0.07443783
## 12    K06G5.1        181531 2.726013e-23 1.308899e-20 0.8157480 0.08204864
## 13     gst-10        178725 1.757955e-22 8.192588e-20 0.9915736 0.10164890
## 14   Y16B4A.2        181572 1.881803e-22 8.519191e-20 0.7677345 0.07875834
## 15   M01G12.9        187385 5.564544e-22 2.449172e-19 2.1503128 0.22312457
## 16     R193.2        182256 1.774988e-21 7.601264e-19 0.8780899 0.09226115
## 17   T25C12.3        259727 2.854599e-21 1.190293e-18 0.7502703 0.07924338
## 18 Y119D3B.21        246006 2.328616e-20 8.999250e-18 0.7831914 0.08470518
## 19      myo-3        179676 9.198664e-20 3.389601e-17 1.0465034 0.11502529
## 20    W05H9.3          <NA> 1.125043e-19 3.961402e-17 0.9587441 0.10563358
##    mean_obs   var_obs     tech_var      sigma_sq smooth_sigma_sq final_sigma_sq
## 1  4.148146 4.3348560 0.0594555773 -6.574551e-03     0.022651161    0.022651161
## 2  5.464653 1.2125366 0.0069123283  3.361685e-02     0.009669959    0.033616853
## 3  5.680775 0.3666408 0.0047652695  8.746247e-05     0.009367957    0.009367957
## 4  5.713564 1.0648062 0.0052853793  5.173509e-02     0.009324889    0.051735089
## 5  3.152944 4.0404434 0.0876218665  1.349575e-01     0.101478562    0.134957538
## 6  3.152944 4.0404434 0.0876218665  1.349575e-01     0.101478562    0.134957538
## 7  5.842900 1.1320885 0.0042960948  5.853441e-02     0.009134034    0.058534409
## 8  6.251220 0.2211785 0.0024208098  9.900426e-03     0.009027147    0.009900426
## 9  5.704640 0.2334873 0.0047861013  3.088382e-03     0.009336868    0.009336868
## 10 6.277316 0.3049706 0.0101260524  9.640806e-03     0.009039492    0.009640806
## 11 6.284922 0.1665729 0.0020386466  4.704482e-03     0.009043333    0.009043333
## 12 6.586667 0.1928418 0.0041737452 -1.006586e-03     0.009290213    0.009290213
## 13 5.038836 0.2986324 0.0081053606  1.255964e-02     0.011227695    0.012559638
## 14 5.856078 0.1750457 0.0032878082  4.460140e-03     0.009117944    0.009117944
## 15 3.599120 1.3319882 0.0494310278 -3.672657e-02     0.050138118    0.050138118
## 16 5.217482 0.2237042 0.0065780359 -2.603720e-03     0.010446203    0.010446203
## 17 7.117130 0.1715950 0.0008316071  1.172742e-02     0.010414013    0.011727418
## 18 6.711261 0.1875539 0.0013701820  1.297975e-02     0.009468767    0.012979752
## 19 5.960508 0.3355869 0.0033371772  2.312446e-02     0.009026752    0.023124459
## 20 5.586875 0.2817546 0.0042322920  1.808462e-02     0.009485827    0.018084615
#we will save the gene ids instead of transcript ones
write.table(unique(sleuth_significant_txn.up$ens_gene), file="results/sleuth_significant_q005_up.txt", quote=FALSE, row.names = FALSE, col.names = "#ens_gene")
sleuth_significant_txn.down <- dplyr::filter(sleuth_table_txn, (qval <= 0.05) & (b<0))
head(sleuth_significant_txn.down, 20)
##          ens_gene   target_id
## 1  WBGene00016627   C44B7.5.1
## 2  WBGene00020218   T04G9.7.1
## 3  WBGene00016453  C35E7.1a.1
## 4  WBGene00012557 Y37D8A.19.1
## 5  WBGene00003977   F56G4.2.1
## 6  WBGene00010158   F56G4.3.1
## 7  WBGene00010808   M01E5.6.1
## 8  WBGene00007196  B0513.4a.1
## 9  WBGene00018605   F48E3.4.1
## 10 WBGene00017648  F20H11.5.1
## 11 WBGene00000301  T13F2.8a.1
## 12 WBGene00018138   F37B4.7.1
## 13 WBGene00009897  F49E12.1.1
## 14 WBGene00021005 W03F11.1a.1
## 15 WBGene00011432   T04D3.2.1
## 16 WBGene00010204   F57F5.1.1
## 17 WBGene00018393   F43E2.5.2
## 18 WBGene00017490  F15E11.1.1
## 19 WBGene00017500 F15E11.14.1
## 20 WBGene00000785   W07B8.5.1
##                                                                         description
## 1                                                                                  
## 2                                                                                  
## 3                       Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 4                                                                                  
## 5    Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 6    Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 7                                                                                  
## 8                                                                                  
## 9                                                                                  
## 10                  D-aspartate oxidase 3  [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 11                             Caveolin-1  [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 12              Folate-like transporter 2  [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 13                      Peroxidase skpo-1  [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 14         Uterine Lumin Expressed/locailized  [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 15         SKN-1 Dependent Zygotic transcript  [Source:UniProtKB/TrEMBL;Acc:O02297]
## 16                                                                                 
## 17           Methionine Sulfoxide Reductase A  [Source:UniProtKB/TrEMBL;Acc:O02089]
## 18          Protein Up-regulated in Daf-2(Gf)  [Source:UniProtKB/TrEMBL;Acc:G5EBF3]
## 19          Protein Up-regulated in Daf-2(Gf)  [Source:UniProtKB/TrEMBL;Acc:G5EBF3]
## 20 Cathepsin B-like cysteine proteinase 5  [Source:UniProtKB/Swiss-Prot;Acc:P43509]
##     ext_gene entrezgene_id         pval         qval          b       se_b
## 1    C44B7.5        174124 7.057950e-85 1.118332e-80 -1.5171989 0.07771507
## 2    T04G9.7        180424 1.576815e-60 8.328212e-57 -1.4495391 0.08832342
## 3      vet-2        172993 1.408001e-49 5.577445e-46 -1.3280453 0.08971666
## 4  Y37D8A.19        176711 3.821684e-39 8.650655e-36 -0.9832194 0.07511980
## 5    pes-2.1        173025 4.000704e-38 7.043462e-35 -1.7272661 0.13380245
## 6    pes-2.2        173026 4.000704e-38 7.043462e-35 -1.7272661 0.13380245
## 7     sepa-1        173196 8.901306e-38 1.410412e-34 -1.5304891 0.11912883
## 8    B0513.4        178351 1.780045e-37 2.564074e-34 -1.1603081 0.09069434
## 9    F48E3.4        181006 1.672903e-36 2.208929e-33 -0.9861263 0.07815018
## 10     ddo-3        184746 6.915372e-36 8.428775e-33 -0.9699391 0.07755740
## 11     cav-1        177815 1.116744e-33 1.263915e-30 -1.1029848 0.09119031
## 12    folt-2        178745 3.081094e-32 3.254662e-29 -0.8779173 0.07427461
## 13    skpo-1        174340 5.041138e-31 4.992302e-28 -0.8518164 0.07354191
## 14     ule-1        171778 2.273631e-29 2.119158e-26 -0.8533348 0.07584102
## 15    sdz-30        173198 2.532942e-29 2.229693e-26 -1.8614224 0.16557609
## 16   F57F5.1        179645 2.536609e-28 2.009629e-25 -0.8851310 0.08019750
## 17    msra-1        185709 4.221976e-27 2.675888e-24 -1.3144354 0.12191886
## 18   pud-2.1        178713 2.139720e-25 1.210852e-22 -0.8263971 0.07935422
## 19   pud-2.2        178710 2.139720e-25 1.210852e-22 -0.8263971 0.07935422
## 20     cpr-5        178612 2.938093e-25 1.605313e-22 -0.8746309 0.08423025
##    mean_obs   var_obs     tech_var     sigma_sq smooth_sigma_sq final_sigma_sq
## 1  6.017851 0.6664751 0.0030782318 0.0071785109     0.009001033    0.009001033
## 2  5.336970 0.6071586 0.0056274997 0.0023363218     0.009974555    0.009974555
## 3  5.606577 0.5104205 0.0066373958 0.0009517097     0.009460763    0.009460763
## 4  6.422207 0.2858795 0.0016874835 0.0095984858     0.009127380    0.009598486
## 5  4.532240 0.8831048 0.0075861955 0.0282199959     0.014554710    0.028219996
## 6  4.532240 0.8831048 0.0075861955 0.0282199959     0.014554710    0.028219996
## 7  4.525542 0.6886811 0.0137410161 0.0089212257     0.014642339    0.014642339
## 8  5.248247 0.3947901 0.0061483985 0.0056684408     0.010302527    0.010302527
## 9  6.154206 0.2883113 0.0021870626 0.0100278374     0.008995154    0.010027837
## 10 5.835642 0.2757498 0.0028869321 0.0052272197     0.009143370    0.009143370
## 11 5.226113 0.3589810 0.0062272018 0.0070588477     0.010404143    0.010404143
## 12 7.013395 0.2272664 0.0009177897 0.0073134062     0.010115646    0.010115646
## 13 6.620497 0.2086972 0.0014829410 0.0001334146     0.009333883    0.009333883
## 14 6.022399 0.2177003 0.0025040739 0.0087528577     0.008999647    0.008999647
## 15 4.018878 1.0323351 0.0278325853 0.0215939443     0.026998299    0.026998299
## 16 7.583655 0.2299978 0.0005303502 0.0066480959     0.012332928    0.012332928
## 17 5.939860 0.5161979 0.0206884267 0.0056289443     0.009039990    0.009039990
## 18 5.524850 0.2059185 0.0023758220 0.0102183621     0.009571054    0.010218362
## 19 5.524850 0.2059185 0.0023758220 0.0102183621     0.009571054    0.010218362
## 20 6.393223 0.2307279 0.0019802600 0.0122092099     0.009107001    0.012209210
write.table(unique(sleuth_significant_txn.down$ens_gene), file="results/sleuth_significant_q005_down.txt", quote=FALSE, row.names = FALSE, col.names = "#ens_gene")

#Summarise some of the numbers
#all transcripts and genes considered in the analysis
length(sleuth_table_txn$target_id)
## [1] 61428
length(sleuth_table_wt$target_id)
## [1] 46904
length(unique(sleuth_table_wt$target_id))
## [1] 46904
#transcripts and genes with sleuth data
sum(sleuth_table_wt$num_aggregated_transcripts, na.rm=TRUE )
## [1] 15845
length(sleuth_table_wt[!is.na(sleuth_table_wt$num_aggregated_transcripts),] $target_id)
## [1] 11337
#significantly DE genes (aggregated) at q<0.05
length(sleuth_significant_wt$target_id)
## [1] 2339
#significantly DE transcripts
length(sleuth_significant_txn$target_id)
## [1] 2263
#Some additional exploratory analysis using sleuth's own functions
plot_pca(so,
         color_by = 'condition',
         text_labels = TRUE)

#note that using tpm as unit does not result in a great separation along the PC1 
#the two types of samples are separated but only just (the SGs cluster together though)
plot_pca(so,
          color_by = 'condition',
          text_labels = TRUE, units="tpm")

plot_pca(so,
         pc_x=2,
         pc_y=3,
         color_by = 'condition',
         text_labels = TRUE)

plot_pca(so,
         pc_x=2,
         pc_y=8,
         color_by = 'condition',
         text_labels = TRUE)

#how much of the variance is accounted for by each PC?
plot_pc_variance(obj = so, units = "est_counts")

plot_pc_variance(obj = so, units = "tpm")

#note PC8 also separates the samples
plot_loadings(so, pc_input=1)

plot_loadings(so, pc_input=2)

plot_loadings(so, pc_input=3)

plot_loadings(so, pc_input=8) 

#note that all influential genes in PCA seem to be very highly expressed (over 1000 tpm)

plot_group_density(so,
                   use_filtered = FALSE,
                   units = "est_counts",
                   trans = "log",
                   grouping = "condition")
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `summarise_()` is deprecated as of dplyr 0.7.0.
## Please use `summarise()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

plot_group_density(so,
                   use_filtered = TRUE,
                   units = "est_counts",
                   trans = "log",
                   grouping = "condition")

plot_sample_heatmap(so)

#check whether there are sex differences
#list of sex determining genes is from: 
#https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2619291/pdf/324.pdf
list.sex_genes <- c("xol-1", "sdc-1", "sdc-2", 
                    "her-1", "tra-2", "tra-3", 
                    "fem-1", "fem-2", "fem-3", "tra-1")

plot_transcript_heatmap(so, 
                        transcripts=dplyr::filter(sleuth_table_txn, 
                                                  ext_gene %in% list.sex_genes)$target_id)

#repeat the plot with est_counts instead of tpm
plot_transcript_heatmap(so, 
                        transcripts=dplyr::filter(sleuth_table_txn, 
                                                  ext_gene %in% list.sex_genes)$target_id,
                        units="est_counts")


#ZK287.8b.1 seems to separate two of the normal samples but it's a transcript that 
#is not used in differential expression due to filtering - below are original counts
#note that basic filter in sleuth filters out any transcripts with fewer than 5 estimated counts
#in 47% of the samples (effectively here half of the samples)
so$obs_raw[so$obs_norm$target_id == "ZK287.8b.1",]
##         target_id est_counts  eff_len len      tpm sample
## 480169 ZK287.8b.1   0.000000 54.84584 196  0.00000  N1_S5
## 480170 ZK287.8b.1   3.000000 51.71815 196 11.69946  N2_S6
## 480171 ZK287.8b.1   3.825966 52.92062 196 16.27609  N3_S7
## 480172 ZK287.8b.1   0.000000 51.52067 196  0.00000  N4_S8
## 480173 ZK287.8b.1   0.000000 55.92125 196  0.00000 SG1_S1
## 480174 ZK287.8b.1   0.000000 53.04100 196  0.00000 SG2_S2
## 480175 ZK287.8b.1   0.000000 54.34783 196  0.00000 SG3_S3
## 480176 ZK287.8b.1   0.000000 53.99656 196  0.00000 SG4_S4
#volcano plot for genes
#Note: there is a problem with volcano and MA plots not setting hte p-value aggregate to FALSE when reading
#results from the WT test and thus not realising there is a b column
#Error reported is: 
#Error in FUN(X[[i]], ...) : object 'b' not found
#To fix this, manually set the pval_aggregate in so before supplying to plots as suggested here:
#https://github.com/pachterlab/sleuth/issues/233
# so$pval_aggregate <- FALSE
# plot_volcano(obj=so, 
        # test = "conditionneurexin", 
        # test_type = "wt", 
        # which_model="full", 
        # sig_level = 0.05, 
        # sig_color = "red")
#can also do MA plot
# plot_ma(obj=so, 
        # test = "conditionneurexin", 
        # test_type = "wt", 
        # which_model="full", 
        # sig_level = 0.05, 
        # sig_color = "red")
# ALTERNATIVELY
#use the EnhancedVolcano package and function to make nicer plots and to also highlight 
#genes easily (the highlight function in plot_volcano from sleuth is not working for me)

#not used as using cutoffs within the volcano plot options
#select.lab <- head(sleuth_significant_txn, 20)$target_id

#use the same function that plot_volcano() uses to produce the data frame needed to make the plot
toptable <- sleuth_results(so, 
                           test="conditionneurexin", 
                           test_type="wt", 
                           which_model="full",
                           pval_aggregate=FALSE,
                           rename_cols=FALSE,
                           show_all=FALSE)
toptable <- dplyr::mutate(toptable, significant = qval < 0.05)
dim(dplyr::filter(toptable, abs(b)>1))
## [1] 2045   16
#there are 2045 transcripts

EnhancedVolcano(
     toptable=toptable,
     lab=toptable$target_id,
     x='b',
     y='qval',
     xlab="beta_value",
     selectLab = NULL,
     title="Sleuth results (without p-value aggregation)",
     subtitle="",
     legend=c("NS","beta","P","P & beta"),
     legendLabels=c("NS","beta","P","P & beta"),
     transcriptLabSize=5.0,
     pLabellingCutoff=10e-6,
     FCcutoff=1.0
     )
## Warning in EnhancedVolcano(toptable = toptable, lab = toptable$target_id, :
## transcriptLabSize argument deprecated in v1.4 - please use labSize
#create a table with information about top 10 most significant genes
select.res <- dplyr::filter(toptable, abs(b)>1)[1:10,c(1,2,5,6,7)]
dplyr::filter(t2g, target_id %in% select.res$target_id)
##          ens_gene  target_id
## 1  WBGene00000657  F38A3.1.1
## 2  WBGene00003977  F56G4.2.1
## 3  WBGene00007196 B0513.4a.1
## 4  WBGene00008862  F15D4.5.1
## 5  WBGene00010158  F56G4.3.1
## 6  WBGene00010808  M01E5.6.1
## 7  WBGene00016453 C35E7.1a.1
## 8  WBGene00016627  C44B7.5.1
## 9  WBGene00020218  T04G9.7.1
## 10 WBGene00022644  ZK6.10a.1
##                                                                             description
## 1                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 2        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 3                                                                                      
## 4                                                                                      
## 5        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 6                                                                                      
## 7                           Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 8                                                                                      
## 9                                                                                      
## 10 Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
##    ext_gene entrezgene_id
## 1    col-81        174686
## 2   pes-2.1        173025
## 3   B0513.4        178351
## 4   F15D4.5        174969
## 5   pes-2.2        173026
## 6    sepa-1        173196
## 7     vet-2        172993
## 8   C44B7.5        174124
## 9   T04G9.7        180424
## 10   dod-19        178564
merge(x=select.res, y=dplyr::filter(t2g, target_id %in% select.res$target_id), by="target_id", sort=FALSE, no.dups=TRUE)[,c("target_id","ext_gene","entrezgene_id.x","qval","description")]
##     target_id ext_gene entrezgene_id.x         qval
## 1   C44B7.5.1  C44B7.5          174124 1.118332e-80
## 2   F15D4.5.1  F15D4.5          174969 1.278185e-77
## 3   T04G9.7.1  T04G9.7          180424 8.328212e-57
## 4  C35E7.1a.1    vet-2          172993 5.577445e-46
## 5   F38A3.1.1   col-81          174686 1.185054e-42
## 6   ZK6.10a.1   dod-19          178564 1.619225e-37
## 7   F56G4.2.1  pes-2.1          173025 7.043462e-35
## 8   F56G4.3.1  pes-2.2          173026 7.043462e-35
## 9   M01E5.6.1   sepa-1          173196 1.410412e-34
## 10 B0513.4a.1  B0513.4          178351 2.564074e-34
##                                                                             description
## 1                                                                                      
## 2                                                                                      
## 3                                                                                      
## 4                           Very Early Transcript  [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5                                        COLlagen  [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6  Downstream Of DAF-16 (Regulated by DAF-16)  [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 8        Patterned Expression Site; Pes-2 protein  [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9                                                                                      
## 10
#Examine some bootstrapped counts for individual transcripts of the same gene
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[1], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[2], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[3], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[4], units = "est_counts", color_by = "condition")

#use the shiny app when not knitting the document
#sleuth_live(so)

Session information

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] PCAtools_1.2.0              cowplot_1.0.0              
##  [3] lattice_0.20-41             reshape2_1.4.4             
##  [5] apeglm_1.8.0                pheatmap_1.0.12            
##  [7] DESeq2_1.26.0               SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [11] matrixStats_0.56.0          EnhancedVolcano_1.4.0      
## [13] ggrepel_0.8.2               Rgraphviz_2.30.0           
## [15] topGO_2.38.1                SparseM_1.78               
## [17] graph_1.64.0                org.Ce.eg.db_3.10.1        
## [19] GO.db_3.10.0                goseq_1.38.0               
## [21] geneLenDataBase_1.22.0      BiasedUrn_1.07             
## [23] reactome.db_1.70.0          AnnotationDbi_1.48.0       
## [25] Biobase_2.46.0              biomaRt_2.42.1             
## [27] ggplot2_3.3.2               dplyr_1.0.0                
## [29] tximport_1.14.2             sleuth_0.30.0              
## [31] tidyr_1.1.0                 rtracklayer_1.46.0         
## [33] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [35] IRanges_2.20.2              S4Vectors_0.24.4           
## [37] BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.8          Hmisc_4.4-0              BiocFileCache_1.10.2    
##   [4] plyr_1.8.6               lazyeval_0.2.2           splines_3.6.3           
##   [7] digest_0.6.25            htmltools_0.5.0          magrittr_1.5            
##  [10] checkmate_2.0.0          memoise_1.1.0            cluster_2.1.0           
##  [13] Biostrings_2.54.0        annotate_1.64.0          askpass_1.1             
##  [16] bdsmatrix_1.3-4          prettyunits_1.1.1        jpeg_0.1-8.1            
##  [19] colorspace_1.4-1         blob_1.2.1               rappdirs_0.3.1          
##  [22] xfun_0.14                crayon_1.3.4             RCurl_1.98-1.2          
##  [25] genefilter_1.68.0        survival_3.2-3           glue_1.4.1              
##  [28] gtable_0.3.0             zlibbioc_1.32.0          XVector_0.26.0          
##  [31] BiocSingular_1.2.2       Rhdf5lib_1.8.0           scales_1.1.1            
##  [34] mvtnorm_1.1-1            DBI_1.1.0                Rcpp_1.0.4              
##  [37] xtable_1.8-4             progress_1.2.2           emdbook_1.3.12          
##  [40] htmlTable_2.0.0          dqrng_0.2.1              rsvd_1.0.3              
##  [43] foreign_0.8-75           bit_1.1-15.2             Formula_1.2-3           
##  [46] htmlwidgets_1.5.1        httr_1.4.1               RColorBrewer_1.1-2      
##  [49] acepack_1.4.1            ellipsis_0.3.1           pkgconfig_2.0.3         
##  [52] XML_3.99-0.3             farver_2.0.3             nnet_7.3-14             
##  [55] dbplyr_1.4.4             locfit_1.5-9.4           tidyselect_1.1.0        
##  [58] labeling_0.3             rlang_0.4.6              munsell_0.5.0           
##  [61] tools_3.6.3              generics_0.0.2           RSQLite_2.2.0           
##  [64] evaluate_0.14            stringr_1.4.0            yaml_2.2.1              
##  [67] knitr_1.28               bit64_0.9-7              purrr_0.3.4             
##  [70] nlme_3.1-148             compiler_3.6.3           rstudioapi_0.11         
##  [73] curl_4.3                 png_0.1-7                tibble_3.0.1            
##  [76] geneplotter_1.64.0       stringi_1.4.6            GenomicFeatures_1.38.2  
##  [79] Matrix_1.2-18            vctrs_0.3.1              pillar_1.4.4            
##  [82] lifecycle_0.2.0          irlba_2.3.3              data.table_1.12.8       
##  [85] bitops_1.0-6             R6_2.4.1                 latticeExtra_0.6-29     
##  [88] gridExtra_2.3            MASS_7.3-51.6            assertthat_0.2.1        
##  [91] rhdf5_2.30.1             openssl_1.4.1            withr_2.2.0             
##  [94] GenomicAlignments_1.22.1 Rsamtools_2.2.3          GenomeInfoDbData_1.2.2  
##  [97] mgcv_1.8-31              hms_0.5.3                rpart_4.1-15            
## [100] coda_0.19-3              rmarkdown_2.3            DelayedMatrixStats_1.8.0
## [103] aggregation_1.0.1        bbmle_1.0.23.1           numDeriv_2016.8-1.1     
## [106] base64enc_0.1-3
#save.image(file="neurexin_analysis.Rdata")